Iterative Human Coding and Computational Text Analysis: Application to Assessing the Effects of Public Pressure on Policy

Devin Judge-Lord | Harvard University |



Human-coding and computational text analysis are more powerful when combined in an interactive workflow. I offer a suite of exact methods that can increase the power of common hand-coding tasks by several orders of magnitude. Human coding can both inform and be aided by rule-based information extraction, iteratively structuring queries on unstructured text.

  1. Computational text analysis tools can strategically select texts for human coders, including texts representing larger samples and outlier texts of high inferential value.
  2. Preprocessing documents can speed hand-coding by extracting features like named entities and key sentences.
  3. Human and machines can iteratively tag entities using regex tables (e.g., identify organizations in documents)
  4. Human and machines can iteratively group texts by key features (e.g., identify lobbying coalitions by common policy demands)

Applying this method to public comments on U.S. Federal Agency rules, a hand-coded sample of 10,894 hand-coded comments yields 41 million as-good-as-hand-coded comments regarding both the organizations that mobilized them and the extent to which policy changed in the direction they sought. This large sample enables new analyses of lobbying coalitions, social movements, and policy change.


Hand-coding dynamic data

Workflow: googlesheets4allows analysis and improving data in real time. For example, in Figure 1:

  • The “org_name” column is populated with a guess from automated methods. As humans identify new organizations and aliases, other documents with the same entity strings are auto-coded the same way.
  • As humans identify each organizations’ policy “ask,” other texts with the same ask are put in the same coalition.
  • If the organization and coalition of a comment become known, it no longer needs to be coded by hand.

Figure 1: Example Coded Comments in a Google Sheet

Example Coded Comments in a Google Sheet

Regex tables to tage entities

  • Deductive: Start with databases of relevant known entities
  • Inductive: Add most frequently appearing entities to regex tables
  • Iterative: Add to regex tables as humans identify new entities or new aliases for known entities. Update data to speed hand coding.
Table 1: A Regex Table Deduced from Center For Responsive Politics Lobbying Data
Entity Pattern
3M Co 3M Co|3M Cogent|3M Health Information Systems|Ceradyne|Cogent Systems|Hybrivet Systems
Teamsters Union Brotherhood of Locomotive Engineers & Trainmen|Brotherhood of Maint of Way Employ Div|New England Teamsters & Trucking Pension|Teamsters Airline Express Delivery Div|Teamsters Local 357|Teamsters Union|Western Conf of Teamsters Pension Trust

SOME TEXT HERE

Figure 2: Iteratively build regex tables. For example, the legislators package adds legisaltor name varients (e.g., “AOC”) to standard legislator names

Iteratively build regex tables. For example, the `legislators` package adds legisaltor name varients (e.g., "AOC") to standard legislator namesIteratively build regex tables. For example, the `legislators` package adds legisaltor name varients (e.g., "AOC") to standard legislator names

Results: Who Mobilizes Public Comments?

Iteratively linking comments to the organizations that wrote or mobilized them (and thus strings to identify similar documents), I find that a small number of professional advocacy organizations mobilize the vast majority of comments. The top 100 organizations mobilized 43,938,811 comments. The top ten organizations mobilized 25,947,612.

Table 2: The Top 5 Organizations Mobilized 20 Million Public Comments 2005-2020
Organization Rules Lobbied On Pressure Campaigns Percent (Campaigns /Rules) Comments Average per Campaign
NRDC 530 62 11.7% 5,939,264 95795
Sierra Club 591 110 18.6% 5,111,922 46472
CREDO 90 41 45.6% 3,019,150 73638
Environmental Defense Fund 111 31 27.9% 2,849,517 91920
Center For Biological Diversity 572 86 15.0% 2,815,509 32738
Earthjustice 235 59 25.1% 2,080,583 35264

Grouping with text reuse

Figure 3: Iteratively cluster documents with repeated text

Iteratively cluster documents with repeated text

Collapsing form letters with text reuse

Figure 4: Identifying Coalitions by the Percent of Matching Text in a Sample of Public Comments using a 10-gram Window

Identifying Coalitions by the Percent of Matching Text in a Sample of Public Comments using a 10-gram Window

Figure 5: Most Comments Result from Public Pressure Campaigns, 2005-2020

Most Comments Result from Public Pressure Campaigns, 2005-2020

Iterative grouping with key phrases

  1. Humans identify groups (e.g., lobbying coalitions) of selected documents
  2. Humans copy and paste key phrases from text
  3. Machines puts other documents containing those phrases in the coalition

Preprocessing tips: Digitizing allows humans to paste text exactly matching machine-read strings. Summaries (e.g., textrank’s top 3 sentences) speed hand-coding.


Results: Coalition size and coalition success

Figure 6: Lobbying Success by Number of Supportive Comments

Lobbying Success by Number of Supportive Comments

Public pressure to address climate change and environmental justice movements had large effects on policy documents, but a small number of national advocacy organizations dominate lobbying coalitions. When tribal governments or local groups lobby without the support of national advocacy groups, policymakers typically ignore them.


Next steps

  • Compare exact entity linking (regex tables) to probabilistic methods (linkit, fastlink, ML with hand-coded training set)
  • Compare exact grouping (e.g., by policy demands) to supervised probabilistic classifiers/clustering methods

Refrences